Source: [United Nations]


Assessing Sustainable Development Goals performance worldwide.


1. Objective

In 2015, the United Nations (UN) approved the 2030 Agenda for Sustainable Development, encompassing the 17 Sustainable Development Goals (SDGs) to achieve a better and more sustainable future for the people and the planet. The SDGs address the global challenges, including those related to poverty, inequality, climate change, environmental degradation, peace and justice.

In this sense, the objective of this report is to classify countries globally based on their assessment of the Sustainable Development Goals into homogeneous groups. The aim is to understand the main disparities between countries and identify the areas that struggle to achieve the goals in relation to the socioeconomic and political structure of the countries. Accordingly, the report focuses on analysing whether those groups of countries sharing a similar progress in the17 SDGs, also converge in terms of socioeconomic and political characteristics. Therefore, after classifying countries into homogeneous groups, each cluster will be examined based on their income level and socioeconomic factors to analyse to which extent the structures of the economies affect the achievement of the SDGs.


The 17 SDGs are listed below:

• SGD 1: No Poverty

• SGD 2: Zero Hunger

• SGD 3: Good Health and Well-being

• SGD 4: Quality Education

• SDG 5: Gender Equality

• SDG 6: Clean Water and Sanitation

• SDG 7: Affordable and Clean Energy

• SDG 8: Decent Work and Economic Growth

• SDG 9: Industry, Innovation and Infrastructure

• SGD 10: Reduced Inequalities

• SDG 11: Sustainable Cities and Communities

• SDG 12: Responsible Consumption and Production

• SDG 13: Climate Action

• SDG 14: Life Bellow Water

• SDG 15: Life on Land: Protect

• SDG 16: Peace, Justice and Strong Institutions

• SGD 17: Partnership for the Goals

Source: [United Nations]

Libraries

library(tidyverse)
library(GGally)     
library(factoextra) 
library(countrycode)
library(rworldmap)
library(mice)
library(plotly)
library(dplyr)
library(readr)
library(readxl)
library(gplots)
library(cluster)
library(mclust)
library(ggplot2)
library(gridExtra)
library(ggpubr)
library(tidyr)
library(Hmisc) 
library(RColorBrewer)


2. Variables of interest

To perform the analysis, data for 152 countries are included in the report. The SDG progress data, corresponding to the 17 goals, are obtained from the Sustainable Development Report 2022 gathered from the SDG Indicators Database. Additionally, some socioeconomic and political indicators are obtained from The Wold Bank to complement our analysis. In particular, information regarding GDP per capita, Government Effectiveness and Income level is considered in the report.

2.1 Feature extraction

SDGs data

We begin by loading the information on the progress of the countries, including the 17 scores for each goal and the Overall Score per country.It should me noted that only data on the 17 SDGs will be considered for both PCA and Clustering analysis, while the Overall Score will only be used during the descriptive analysis to provide an overview of the countries’ overall progress in terms of environmental development.

report2022 <- read.csv("Sustainable_Development_Report_2022.csv")
report2022<- report2022 %>% 
  dplyr::select(Name, ID, Overall_Score, starts_with("Goal_") & ends_with("Score"))

head(report2022)
##      Name  ID Overall_Score Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## 1 Finland FIN      86.50874      99.8215     64.17350     94.71350     98.05167
## 2 Denmark DNK      85.63330      99.6960     66.39612     95.42814     97.73400
## 3  Sweden SWE      85.18928      98.8810     63.41087     95.71892     99.87567
## 4  Norway NOR      82.34929      99.5130     60.37638     97.24957     97.60700
## 5 Austria AUT      82.31520      99.3415     73.70100     91.94857     98.24167
## 6 Germany DEU      82.17874      99.5335     72.57800     93.77714     97.29967
##   Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score Goal_9_Score
## 1     91.07725      93.6282     89.02200     87.66233     94.41333
## 2     86.80300      89.8198     88.13250     88.88167     96.40733
## 3     90.91900      95.0576     93.28975     83.86317     97.32617
## 4     90.39525      84.9070     96.76025     83.79250     91.33417
## 5     82.85667      92.3754     85.21000     84.04100     95.67200
## 6     80.49150      88.5496     76.59475     86.91250     93.43133
##   Goal_10_Score Goal_11_Score Goal_12_Score Goal_13_Score Goal_14_Score
## 1       98.4375      92.04550      70.24829      60.22333      85.12733
## 2       98.3890      95.06775      54.80729      58.53600      71.32233
## 3       93.3540      92.01250      63.09129      60.23900      67.26467
## 4       99.8590      94.00575      50.77643      20.64167      73.90250
## 5       93.7895      93.01467      56.78743      55.30133            NA
## 6       89.1155      90.90825      59.36029      55.58100      67.64233
##   Goal_15_Score Goal_16_Score Goal_17_Score
## 1       84.9884       94.1074      72.90750
## 2       92.8168       93.2551      82.27325
## 3       80.1226       86.5778      87.21375
## 4       73.7214       90.4535      94.64250
## 5       73.4606       91.2038      68.70000
## 6       79.1014       84.1893      81.97250


World Development Indicators

Data on GDP per capita and Government Effectiveness is also incorporated for each country for the year 2022. Including information about GDP can provide insights into the economic structure of different countries, allowing us to assess the economic development levels across nations. Furthermore, considering government effectiveness data is crucial for understanding how governance quality of countries may impact their ability to achieve sustainable development goals.

indicators <- read_excel("World_Development_Indicators.xlsx",
  col_types = c("text", "text", "numeric", "numeric"))

indicators <- indicators %>% 
  dplyr::rename(Country_Name = `Country Name`, Country_Code = `Country Code`)

str(indicators)
## tibble [217 × 4] (S3: tbl_df/tbl/data.frame)
##  $ Country_Name           : chr [1:217] "Afghanistan" "Albania" "Algeria" "American Samoa" ...
##  $ Country_Code           : chr [1:217] "AFG" "ALB" "DZA" "ASM" ...
##  $ GovernmentEffectiveness: num [1:217] -1.8796 0.0651 -0.5131 0.6679 1.4953 ...
##  $ GDPpercapita           : num [1:217] NA 6810 4343 19673 41993 ...
head(indicators)
## # A tibble: 6 × 4
##   Country_Name   Country_Code GovernmentEffectiveness GDPpercapita
##   <chr>          <chr>                          <dbl>        <dbl>
## 1 Afghanistan    AFG                          -1.88            NA 
## 2 Albania        ALB                           0.0651        6810.
## 3 Algeria        DZA                          -0.513         4343.
## 4 American Samoa ASM                           0.668        19673.
## 5 Andorra        AND                           1.50         41993.
## 6 Angola         AGO                          -1.04          3000.


Income Classification

Finally, we incorporate The World Bank’s country classification based on four income groups: low, lower-middle, upper-middle, and high income. This classification provides additional information to our analysis by categorizing countries according to their income levels, which can provide context about the economic structure and development of each nation.

class <- read_excel("class.xlsx")
class <- class %>% 
  dplyr::rename(Income_group = `Income group`)


2.2 Preparing and cleaning the data

Once we have obtained the datasets containing all the variables, we need to clean and apply some transformation in order to prepare the data for the analysis.


Handling missing values

An essential step is to address missing values in our dataset. We begin by examining countries with missing values in the ‘Overall Score’ variable, as it implies that the corresponding 17 SDG indicators are also missing due to the small size of these countries.

# analyse missing values
sapply(report2022, function(x) sum(is.na(x))*100/nrow(report2022)) 
##          Name            ID Overall_Score  Goal_1_Score  Goal_2_Score 
##       0.00000       0.00000      19.68912      24.87047      19.68912 
##  Goal_3_Score  Goal_4_Score  Goal_5_Score  Goal_6_Score  Goal_7_Score 
##      19.68912      20.20725      19.68912      19.68912      19.68912 
##  Goal_8_Score  Goal_9_Score Goal_10_Score Goal_11_Score Goal_12_Score 
##      19.68912      19.68912      27.97927      19.68912      19.68912 
## Goal_13_Score Goal_14_Score Goal_15_Score Goal_16_Score Goal_17_Score 
##      19.68912      40.41451      19.68912      19.68912      19.68912
na_counts <- colSums(is.na(report2022)) 
print(na_counts)
##          Name            ID Overall_Score  Goal_1_Score  Goal_2_Score 
##             0             0            38            48            38 
##  Goal_3_Score  Goal_4_Score  Goal_5_Score  Goal_6_Score  Goal_7_Score 
##            38            39            38            38            38 
##  Goal_8_Score  Goal_9_Score Goal_10_Score Goal_11_Score Goal_12_Score 
##            38            38            54            38            38 
## Goal_13_Score Goal_14_Score Goal_15_Score Goal_16_Score Goal_17_Score 
##            38            78            38            38            38
columns_with_na <- names(na_counts[na_counts > 0])  
print(columns_with_na)
##  [1] "Overall_Score" "Goal_1_Score"  "Goal_2_Score"  "Goal_3_Score" 
##  [5] "Goal_4_Score"  "Goal_5_Score"  "Goal_6_Score"  "Goal_7_Score" 
##  [9] "Goal_8_Score"  "Goal_9_Score"  "Goal_10_Score" "Goal_11_Score"
## [13] "Goal_12_Score" "Goal_13_Score" "Goal_14_Score" "Goal_15_Score"
## [17] "Goal_16_Score" "Goal_17_Score"
# remove missing Overall Scores
report2022 <- report2022[!is.na(report2022$Overall_Score), , drop = FALSE]
na_counts <- colSums(is.na(report2022))
print(na_counts)
##          Name            ID Overall_Score  Goal_1_Score  Goal_2_Score 
##             0             0             0            10             0 
##  Goal_3_Score  Goal_4_Score  Goal_5_Score  Goal_6_Score  Goal_7_Score 
##             0             1             0             0             0 
##  Goal_8_Score  Goal_9_Score Goal_10_Score Goal_11_Score Goal_12_Score 
##             0             0            16             0             0 
## Goal_13_Score Goal_14_Score Goal_15_Score Goal_16_Score Goal_17_Score 
##             0            40             0             0             0


Automatic imputation of NAs

Additionally, we observe missing values corresponding to specific goals. In this case, we apply automatic imputation of NAs, so that the missing scores are replaced with prediction.

m = 4  
mice_mod <- mice(report2022, m=m, method='rf') 
## 
##  iter imp variable
##   1   1  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   1   2  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   1   3  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   1   4  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   2   1  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   2   2  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   2   3  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   2   4  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   3   1  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   3   2  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   3   3  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   3   4  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   4   1  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   4   2  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   4   3  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   4   4  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   5   1  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   5   2  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   5   3  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
##   5   4  Goal_1_Score  Goal_4_Score  Goal_10_Score  Goal_14_Score
report2022 <- complete(mice_mod, action=m)    



Finally, we apply the same procedure and substitute missing values on GDP per capita and Government Effectiveness with prediction.

m = 4 
mice_mod <- mice(indicators, m=m, method='rf') 
## 
##  iter imp variable
##   1   1  GovernmentEffectiveness  GDPpercapita
##   1   2  GovernmentEffectiveness  GDPpercapita
##   1   3  GovernmentEffectiveness  GDPpercapita
##   1   4  GovernmentEffectiveness  GDPpercapita
##   2   1  GovernmentEffectiveness  GDPpercapita
##   2   2  GovernmentEffectiveness  GDPpercapita
##   2   3  GovernmentEffectiveness  GDPpercapita
##   2   4  GovernmentEffectiveness  GDPpercapita
##   3   1  GovernmentEffectiveness  GDPpercapita
##   3   2  GovernmentEffectiveness  GDPpercapita
##   3   3  GovernmentEffectiveness  GDPpercapita
##   3   4  GovernmentEffectiveness  GDPpercapita
##   4   1  GovernmentEffectiveness  GDPpercapita
##   4   2  GovernmentEffectiveness  GDPpercapita
##   4   3  GovernmentEffectiveness  GDPpercapita
##   4   4  GovernmentEffectiveness  GDPpercapita
##   5   1  GovernmentEffectiveness  GDPpercapita
##   5   2  GovernmentEffectiveness  GDPpercapita
##   5   3  GovernmentEffectiveness  GDPpercapita
##   5   4  GovernmentEffectiveness  GDPpercapita
indicators <- complete(mice_mod, action=m)    
summary(indicators)
##  Country_Name       Country_Code       GovernmentEffectiveness  GDPpercapita   
##  Length:217         Length:217         Min.   :-2.38987        Min.   :   259  
##  Class :character   Class :character   1st Qu.:-0.74527        1st Qu.:  2255  
##  Mode  :character   Mode  :character   Median :-0.09709        Median :  6984  
##                                        Mean   :-0.04102        Mean   : 19095  
##                                        3rd Qu.: 0.65083        3rd Qu.: 25057  
##                                        Max.   : 2.14483        Max.   :240862


Merging the data

After removing the missing values of the datasets containing our variables of interest—SDG scores, socioeconomic indicators, and country income group information—we merge the data into a single dataset.

merged_data <- indicators %>%
 inner_join(class, by = c("Country_Code"="Code"))
merged_data <- merged_data %>% 
  inner_join(report2022,  by = c("Country_Name"="Name"))


Discarding irrelevant data

We begin by removing unnecessary columns to reduce the dimensions of our dataset and eliminate irrelevant information.

merged_data <- merged_data %>%
 dplyr::select(-"Lending category", -"Economy", -"ID")


Next, we examine our data to ensure that the variables are in the correct format.

str(merged_data)
## 'data.frame':    152 obs. of  24 variables:
##  $ Country_Name           : chr  "Afghanistan" "Albania" "Algeria" "Angola" ...
##  $ Country_Code           : chr  "AFG" "ALB" "DZA" "AGO" ...
##  $ GovernmentEffectiveness: num  -1.8796 0.0651 -0.5131 -1.0404 -0.2829 ...
##  $ GDPpercapita           : num  650 6810 4343 3000 13651 ...
##  $ Region                 : chr  "South Asia" "Europe & Central Asia" "Middle East & North Africa" "Sub-Saharan Africa" ...
##  $ Income_group           : chr  "Low income" "Upper middle income" "Lower middle income" "Lower middle income" ...
##  $ Overall_Score          : num  52.5 71.6 71.5 50.9 72.8 ...
##  $ Goal_1_Score           : num  11.8 94.3 97.4 12.9 96.6 ...
##  $ Goal_2_Score           : num  51.6 59.9 58.4 55.8 67.5 ...
##  $ Goal_3_Score           : num  38.1 82.9 76.1 34.8 79.2 ...
##  $ Goal_4_Score           : num  34.4 94.3 87.7 42.2 97.3 ...
##  $ Goal_5_Score           : num  21.7 53.2 53.4 50.3 81.2 ...
##  $ Goal_6_Score           : num  50.4 74.3 60.4 54.3 79.1 ...
##  $ Goal_7_Score           : num  44.1 81.3 65.3 63.6 72.4 ...
##  $ Goal_8_Score           : num  33.8 59.1 60.9 52.9 65.7 ...
##  $ Goal_9_Score           : num  7.44 31.11 46.58 11.25 48.45 ...
##  $ Goal_10_Score          : num  75.5 80.3 97 16.5 43.7 ...
##  $ Goal_11_Score          : num  29.3 74.5 57.8 47.6 82.1 ...
##  $ Goal_12_Score          : num  97.7 86.8 91.4 95.1 82.7 ...
##  $ Goal_13_Score          : num  98.8 88.5 88.6 96.8 88.2 ...
##  $ Goal_14_Score          : num  51.3 42.8 63.7 68.3 63.3 ...
##  $ Goal_15_Score          : num  52.9 80 69.9 66.5 61.2 ...
##  $ Goal_16_Score          : num  49.2 68.7 72.4 49 65.4 ...
##  $ Goal_17_Score          : num  42.9 65.7 69.3 48.3 63.2 ...


Since scores and country indicators should be treated as numeric, we verify their formatting accordingly. Additionally, we convert the variable ‘Income_group’ into a factor variable.

merged_data <- merged_data %>% 
  mutate(Income_group = fct_relevel(Income_group,"Low income","Lower middle income", "Upper middle income", "High income" ))


Following these steps, we have obtained a dataset prepared for analysis containing 24 variables and 152 observations, each representing a country.

dim(merged_data)
## [1] 152  24


3. Pre-process data and descriptive analysis


3.1 Numerical measures

  • Mean, Median, Standard Deviation,1st & 3rd Quantile

In order to get a deeper insight of our data, we start computing some statistics for our variables. Given the results displayed below, we observe that the maximum GDP value is 125.006 and minimum GDP value is 259, indicating that there is at least one observation with a extremely high value and other observation with a notably low value for GDP. Additionally, the third quantile indicates that 75% of the countries have a GDP per capita equal or lower to 21.387. These measures suggest that the variable GDP widely varies across observations and that there is a likely presence of outliers.

data_summary <- merged_data %>%
  summarise(
    avg_GDPpercapita = mean(GDPpercapita),
    median_GDPpercapita = median(GDPpercapita),
    max_GDPpercapita = max(GDPpercapita),
    min_GDPpercapita = min(GDPpercapita),
    sd_GDPpercapita = sd(GDPpercapita),
    Q1_GDPpercapita = quantile(GDPpercapita, probs = 0.25),
    Q3_GDPpercapita = quantile(GDPpercapita, probs = 0.75),
    avg_GovernmentEffectiveness = mean(GovernmentEffectiveness),
    median_GovernmentEffectiveness = median(GovernmentEffectiveness),
    max_GovernmentEffectiveness = max(GovernmentEffectiveness),
    min_GovernmentEffectiveness = min(GovernmentEffectiveness),
    sd_GovernmentEffectiveness = sd(GovernmentEffectiveness),
    Q1_GovernmentEffectiveness = quantile(GovernmentEffectiveness, probs = 0.25),
    Q3_GovernmentEffectiveness = quantile(GovernmentEffectiveness, probs = 0.75),
    avg_Overall_Score = mean(Overall_Score),
    median_Overall_Score = median(Overall_Score),
    max_Overall_Score = max(Overall_Score),
    min_Overall_Score = min(Overall_Score),
    sd_Overall_Score = sd(Overall_Score),
    Q1_Overall_Score = quantile(Overall_Score, probs = 0.25),
    Q3_Overall_Score = quantile(Overall_Score, probs = 0.75)
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = c(".value", "variable"),
    names_sep = "_"
  )

data_summary
## # A tibble: 3 × 8
##   variable                   avg   median    max    min      sd       Q1      Q3
##   <chr>                    <dbl>    <dbl>  <dbl>  <dbl>   <dbl>    <dbl>   <dbl>
## 1 GDPpercapita           1.77e+4 6741.    1.25e5 259.   2.44e+4 2147.    2.14e+4
## 2 GovernmentEffectiven… -4.59e-2   -0.133 2.14e0  -2.39 9.76e-1   -0.748 5.95e-1
## 3 Overall                6.73e+1   69.3   8.65e1  39.0  1.02e+1   60.1   7.46e+1


3.2 Scale

Based on the previous results, we apply a logarithmic transformation to the variable GDP per capita. This transformation is aimed at addressing outliers and the highly asymmetric distribution observed in the data. By taking the logarithm of GDP per capita values, we aim to normalize the distribution and reduce the impact of extreme values, which can distort statistical analyses and modeling techniques.

merged_data <- merged_data %>% 
  mutate(log_GDP = log(GDPpercapita))  %>%
  relocate(log_GDP, .after = GDPpercapita)


merged_data %>%
  dplyr::select(Country_Name, log_GDP, Overall_Score) %>%
  summarise(max_GDPpercapita = max(log_GDP),
            country_with_max_GDP = Country_Name[which.max(log_GDP)],
            Overall_Score = Overall_Score[which.max(log_GDP)])
##   max_GDPpercapita country_with_max_GDP Overall_Score
## 1         11.73612           Luxembourg      75.74422
merged_data %>%
  dplyr::select(Country_Name, log_GDP, Overall_Score) %>%
  summarise(min_GDPpercapita = min(log_GDP),
            country_with_min_GDP = Country_Name[which.min(log_GDP)],
            Overall_Score = Overall_Score[which.min(log_GDP)])
##   min_GDPpercapita country_with_min_GDP Overall_Score
## 1         5.556925              Burundi       54.0531


3.3 Distribution of variables


Socioeconomic variables

There are 21 variables in the dataset of our report (17 SDG scores, Overall Score, GDP, Government Effectiveness and Income Group) . In order to get an idea of the main characteristics of the data, we focus on the socioeconomic variables, which help us to define the countries. In this sense, we plot the distribution of the following variables: GDP per capita, Government Effectiveness and Income Group, where quantitative variables are displayed in Histograms, while qualitative features are represented in Bar Plots.


  • Histograms for GDP per capita and Government Effectiveness:

In the plot below, we can observe the right-skewed distribution of GDP per capita, confirming our previous conclusions that the variable is characterized by a highly asymmetric distribution. This skewness suggest that while the majority of countries have relatively lower GDP per capita values, there are a few countries with extremely high values, leading to a long tail on the right side. To address this skewness, we apply a logarithmic scale to the GDP per capita values and plot the distribution, where we can observe a more symmetric and smooth distribution across countries.

Lastly, we illustrate the distribution of Government Effectiveness, which is an important indicator of the quality of governance within countries. The symmetric distribution suggests a relatively balanced distribution of governance quality across countries, with most nations falling within a similar range of effectiveness scores.

box_gdp <- 
  ggplot(merged_data, mapping = aes(x=GDPpercapita)) + 
  geom_histogram(bins=15,fill="#756bb1", aes(y=..count../sum(..count..)))


box_gdp_log <-  
  ggplot(merged_data,  aes(x = log(GDPpercapita))) +
  geom_histogram(bins = 15, fill = "#bcbddc", 
                        aes(y = ..count.. / sum(..count..))) +
         labs(x = "log(GDP per capita)", y = "Density", title = "Histogram of log(GDP per capita)")


box_govn <- 
  ggplot(merged_data, mapping=aes(x=GovernmentEffectiveness))+
  geom_histogram(bins=15,fill="#c994c7",aes(y=..count../sum(..count..)))


ggarrange(box_gdp, box_gdp_log, box_govn,
          ncol = 3, nrow = 1)


  • Bar plot for Income group distribution:

The bar plot below indicates that our data contain a relatively higher number of observation classified as high-income countries, while the number of those classified as low-income countries is relatively low. In relation to middle categories, both groups are quite balanced in our data.

color_palette <- brewer.pal(n = length(unique(merged_data$Income_group)), name = "Set2")

box_group <- merged_data %>% ggplot(aes(x = reorder(Income_group, Income_group, length))) + 
  geom_bar(aes(fill = Income_group)) +
  labs(caption = "Countries per Income group",
       x = "", y = "") +
  theme(legend.position = "none") +
  scale_fill_manual(values = color_palette) 


box_group


  • Bar plot for region distribution:

In the following bar plot we observe the number of observations located in each of the 7 world areas, where Europe and Central Asia are the one containing more countries, while the North America and South Asia contain a very low number of regions.

color_palette <- brewer.pal(n = length(unique(merged_data$Region)), name = "Set2")

region_box <- merged_data %>% ggplot(aes(x=reorder(Region, Region, length))) + 
  geom_bar(aes(fill=Region)) +
  labs(caption="Countries per Region",
       x = "", y = "")+ 
  theme(legend.position="none") +
  scale_fill_manual(values = color_palette) 

region_box


  • Worldwide Government Effectiveness

In the map below, a clear clear pattern emerges: more developed countries, particularly those in Europe, North America, and Australia, exhibit higher levels of government quality. In contrast, less developed and wealthy countries, such as those in Latin America or Africa, demonstrate lower government effectiveness.

map = merged_data %>% dplyr::select(Country_Name, GovernmentEffectiveness)


map$country = countrycode(map$Country_Name, 'country.name', 'iso3c')


matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="GovernmentEffectiveness",missingCountryCol = "white",
               borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "topo",
               mapTitle = c("Government Effectiveness by Country"), lwd=1)


  • Worldwide GDP per capita

In relation to GDP, a clear distinction between regions in terms of wealth is evident. In this context, wealthier countries are primarily located in Europe, North America, Australia, and New Zealand, followed by Asia and South America. Conversely, poorer countries, as expected, are concentrated in developing regions, such as Africa and parts of Asia.

map = merged_data %>% dplyr::select(Country_Name, log_GDP)


map$country = countrycode(map$Country_Name, 'country.name', 'iso3c')


matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="log_GDP",missingCountryCol = "white",
               borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "topo",
               mapTitle = c("GDP per Country"), lwd=1)


Sustainability developement variables

In order to visualize the distribution of the assessment of the Sustainability Development Goals, we create histograms for each of the 17 scores and the Overall Score, allowing us to explore the performance of the countries.


  • Histogram for Overall score:

The distribution of the overall score across countries suggest two undefined patterns. One part of the plot depicts a normal and symmetric distribution, while on the left side, we can notice a small group of countries characterized by extremely low overall scores.

plot_score <- 
  ggplot(merged_data,mapping=aes(x=Overall_Score))+
  geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))

print(plot_score)

  • Histogram for the 17 goals scores:

By observing the distributions of each of the 17 Suistainability Development Goals scores, we notice that one one hand, Goals 2, Goal 3, Goal 5, Goal 6, Goal 8, Goal 9, Goal 10, Goal 11, Goal 14, Goal 15, Goal 16, and Goal 17 depict a normal and symmetric distribution. On the other hand, Goal 1, Goal 4, Goal 7, Goal 12, and Goal 13 suggest a left skewed distribution, implying that most countries depict high values while a few have extremely low values. These findings point out the varying levels of achievement across different sustainable development targets, highlighting areas where some efforts may be needed to address disparities and improve overall global progress.

box_1<-ggplot(merged_data,mapping=aes(x=Goal_1_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_2<-ggplot(merged_data,mapping=aes(x=Goal_2_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_3<-ggplot(merged_data,mapping=aes(x=Goal_3_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_4<-ggplot(merged_data,mapping=aes(x=Goal_4_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))

 

box_5<-ggplot(merged_data,mapping=aes(x=Goal_5_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_6<-ggplot(merged_data,mapping=aes(x=Goal_6_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_7<-ggplot(merged_data,mapping=aes(x=Goal_7_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_8<-ggplot(merged_data,mapping=aes(x=Goal_8_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
   
box_9<-ggplot(merged_data,mapping=aes(x=Goal_9_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_10<-ggplot(merged_data,mapping=aes(x=Goal_10_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_11<-ggplot(merged_data,mapping=aes(x=Goal_11_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))
box_12<-ggplot(merged_data,mapping=aes(x=Goal_12_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))


box_13 <- ggplot(merged_data, aes(x = Goal_13_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_14 <- ggplot(merged_data, aes(x = Goal_14_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_15 <- ggplot(merged_data, aes(x = Goal_15_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))
box_16 <- ggplot(merged_data, aes(x = Goal_16_Score)) +geom_histogram(bins = 15, fill = "#7fcdbb", aes(y = ..count../sum(..count..)))

box_17 <- ggplot(merged_data,mapping=aes(x=Goal_17_Score))+geom_histogram(bins=15,fill="#7fcdbb",aes(y=..count../sum(..count..)))

ggarrange(box_1, box_2, box_3, 
          box_4,box_5, box_6, 
          box_7, box_8, box_9,
          ncol = 3, nrow = 3)

ggarrange(box_10, box_11, 
          box_12,box_13,
          ncol = 2, nrow = 2)

ggarrange(box_14,box_15, box_16, box_17,
          ncol = 2, nrow = 2)


3.4 Bivariate descriptive analysis

In order to analyse the behaviour of our variables in relation to the others, a bivariate analysis is conducted, where we can observe how the variables interact.


Conditional distribution of variables

The conditional distribution of pair of variables can be represented in order to visualize the dispersion of the variables and observe the behaviour of one of the variables in terms of the other. By doing so, we can explore whether government quality or wealth levels are determinants of sustainability development.


  • Government Effectiveness vs Overall Sustainability Score

In the following interactive plot we can easily observe the levels of overall SDGs score in relation to the governance effectiveness. The results show an increasing pattern, where higher levels of sustainability progress are achieved when government quality increases.

p = ggplot(merged_data, aes(x=GovernmentEffectiveness, y=Overall_Score, group=Region, size=Overall_Score, color=GDPpercapita, text=Country_Name)) + geom_point(alpha=0.9) + 
   geom_point(alpha = 0.9, size = 3) +
  facet_wrap(~ Region) +
  scale_color_gradient(low="lightblue", high="darkblue") +
  theme_minimal()+ theme(legend.position="none") + 
  labs(title = "World countries: Overall Score vs Government Effectiveness", subtitle="(color denotes GDP)",
       x = "GovernmentEffectiveness", y = "Overall Score")

ggplotly(p, tooltip=c("Country_Name"))


  • Economic development (GDP) vs Overall Sustainability Score

Once again, the map reveals a positive association between wealth and sustainability goals achievement, as shown by the increasing overall score in the SDGs with rising GDP per capita.

p=ggplot(merged_data, aes(x=log_GDP, y=Overall_Score, group=Region, size=Overall_Score, color=GovernmentEffectiveness, text=Country_Name)) + geom_point(alpha=0.9) + 
   geom_point(alpha = 0.9, size = 3) +
  facet_wrap(~ Region) +
  scale_color_gradient(low="lightblue", high="darkblue") +
  theme_minimal()+ theme(legend.position="none") + 
  labs(title = "World countries: Overall Score vs GDP per capita", subtitle="(color denotes Government Effectiveness)",
       x = "GDPpercapita", y = "Overall Score")

ggplotly(p, tooltip=c("Country_Name"))


  • Conditional Box plot: Income level - Overall Score

As expected, in the conditional box plot for income level and overall score, we can notice how higher sustainable scores are obtained by countries with higher income levels, with lower scores correspond to countries classified as low-income. These results suggest that as the income level increase, the scores for environmental development also rise.

# Income level - Overall Score


conditional_bx_1 <- ggplot(merged_data, aes(x = Income_group, y = Overall_Score, fill = Income_group)) + 
  geom_boxplot() +
  scale_fill_manual(values = brewer.pal(length(unique(merged_data$Income_group)),  name = "Set2")) +  
   theme(legend.position="none")

conditional_bx_1


  • Conditional Box plot: Income level - GDP

Regarding income level and GDP per capita, we observe the same pattern as before, where higher income levels lead to greater levels of GDP per capita and viceversa.

# Income level - GDP
conditional_bx_2 <- ggplot(merged_data, aes(x=Income_group, y=log(GDPpercapita), fill = Income_group)) + 
  geom_boxplot() +
   scale_fill_manual(values = brewer.pal(length(unique(merged_data$Income_group)),  name = "Set2")) +  
   theme(legend.position="none")
conditional_bx_2


  • Conditional Box plot: Income level - Government Effectiveness

Lastly, the conditional distribution of government effectiveness in terms of income level reveals that higher levels of government quality is achieved in countries with higher income levels, while weaker governance is associated with low-income countries. This implies that the level of government effectiveness increases with the income level.

# Income level - Government 
conditional_bx_3 <- ggplot(merged_data, aes(x=Income_group, y=GovernmentEffectiveness, fill = Income_group)) + 
  geom_boxplot() +
  scale_fill_manual(values = brewer.pal(length(unique(merged_data$Income_group)),  name = "Set2")) +  
   theme(legend.position="none")
conditional_bx_3



3.5 Associative Analysis

Building upon the descriptive analyses discussed in the preceding sections, we can extends the study by conducting associative analysis. This method enable us to assess the presence of consistent and stable linkages between the levels of the variables, in simple terms, whether there exist relationship among the variables under study.

Despite the primary objective of this report is to examine whether there are homogeneous groups of countries in relation to the achievement of the Sustainable Development Goals, we are also interested in analysing the association between the socioeconomic factors and the goals assessment as well as identifying potential relationships between the 17 goals, given that we need at least a slight degree of association between the variables in order to classify countries into homogeneous clusters.


Correlation analysis

Our objective is to assess the stability and significance of relationships among our variables (between socioeconomic variables and goals, as well as among each of the goals). First, we are interested in identifying the presence of such relationship, and subsequently, analysing the direction as well the strength of the relationship, if any. To do so, we compute the correlation matrix for the quantitative variables of our dataset.


  • Correlation matrix:
data <- data.frame(merged_data[1:152, c(3,5,8,9,10,11,12,13,14,15,16,17,18)])

# correlation matrix
cor_matrix <- round(cor(data),2)
cor_matrix
##                         GovernmentEffectiveness log_GDP Overall_Score
## GovernmentEffectiveness                    1.00    0.83          0.76
## log_GDP                                    0.83    1.00          0.79
## Overall_Score                              0.76    0.79          1.00
## Goal_1_Score                               0.63    0.77          0.86
## Goal_2_Score                               0.58    0.50          0.67
## Goal_3_Score                               0.77    0.84          0.92
## Goal_4_Score                               0.68    0.73          0.86
## Goal_5_Score                               0.61    0.60          0.67
## Goal_6_Score                               0.65    0.70          0.86
## Goal_7_Score                               0.53    0.62          0.78
## Goal_8_Score                               0.75    0.70          0.75
## Goal_9_Score                               0.87    0.87          0.83
## Goal_10_Score                              0.36    0.37          0.49
##                         Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## GovernmentEffectiveness         0.63         0.58         0.77         0.68
## log_GDP                         0.77         0.50         0.84         0.73
## Overall_Score                   0.86         0.67         0.92         0.86
## Goal_1_Score                    1.00         0.52         0.87         0.81
## Goal_2_Score                    0.52         1.00         0.59         0.61
## Goal_3_Score                    0.87         0.59         1.00         0.85
## Goal_4_Score                    0.81         0.61         0.85         1.00
## Goal_5_Score                    0.47         0.48         0.60         0.63
## Goal_6_Score                    0.71         0.56         0.79         0.70
## Goal_7_Score                    0.75         0.46         0.74         0.68
## Goal_8_Score                    0.57         0.60         0.70         0.63
## Goal_9_Score                    0.73         0.60         0.83         0.72
## Goal_10_Score                   0.46         0.28         0.46         0.29
##                         Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score
## GovernmentEffectiveness         0.61         0.65         0.53         0.75
## log_GDP                         0.60         0.70         0.62         0.70
## Overall_Score                   0.67         0.86         0.78         0.75
## Goal_1_Score                    0.47         0.71         0.75         0.57
## Goal_2_Score                    0.48         0.56         0.46         0.60
## Goal_3_Score                    0.60         0.79         0.74         0.70
## Goal_4_Score                    0.63         0.70         0.68         0.63
## Goal_5_Score                    1.00         0.61         0.48         0.60
## Goal_6_Score                    0.61         1.00         0.64         0.67
## Goal_7_Score                    0.48         0.64         1.00         0.46
## Goal_8_Score                    0.60         0.67         0.46         1.00
## Goal_9_Score                    0.60         0.74         0.56         0.72
## Goal_10_Score                   0.13         0.38         0.22         0.36
##                         Goal_9_Score Goal_10_Score
## GovernmentEffectiveness         0.87          0.36
## log_GDP                         0.87          0.37
## Overall_Score                   0.83          0.49
## Goal_1_Score                    0.73          0.46
## Goal_2_Score                    0.60          0.28
## Goal_3_Score                    0.83          0.46
## Goal_4_Score                    0.72          0.29
## Goal_5_Score                    0.60          0.13
## Goal_6_Score                    0.74          0.38
## Goal_7_Score                    0.56          0.22
## Goal_8_Score                    0.72          0.36
## Goal_9_Score                    1.00          0.43
## Goal_10_Score                   0.43          1.00
# correlation matrix with p-values
p_matrix <- rcorr(as.matrix(data))
p_matrix
##                         GovernmentEffectiveness log_GDP Overall_Score
## GovernmentEffectiveness                    1.00    0.83          0.76
## log_GDP                                    0.83    1.00          0.79
## Overall_Score                              0.76    0.79          1.00
## Goal_1_Score                               0.63    0.77          0.86
## Goal_2_Score                               0.58    0.50          0.67
## Goal_3_Score                               0.77    0.84          0.92
## Goal_4_Score                               0.68    0.73          0.86
## Goal_5_Score                               0.61    0.60          0.67
## Goal_6_Score                               0.65    0.70          0.86
## Goal_7_Score                               0.53    0.62          0.78
## Goal_8_Score                               0.75    0.70          0.75
## Goal_9_Score                               0.87    0.87          0.83
## Goal_10_Score                              0.36    0.37          0.49
##                         Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## GovernmentEffectiveness         0.63         0.58         0.77         0.68
## log_GDP                         0.77         0.50         0.84         0.73
## Overall_Score                   0.86         0.67         0.92         0.86
## Goal_1_Score                    1.00         0.52         0.87         0.81
## Goal_2_Score                    0.52         1.00         0.59         0.61
## Goal_3_Score                    0.87         0.59         1.00         0.85
## Goal_4_Score                    0.81         0.61         0.85         1.00
## Goal_5_Score                    0.47         0.48         0.60         0.63
## Goal_6_Score                    0.71         0.56         0.79         0.70
## Goal_7_Score                    0.75         0.46         0.74         0.68
## Goal_8_Score                    0.57         0.60         0.70         0.63
## Goal_9_Score                    0.73         0.60         0.83         0.72
## Goal_10_Score                   0.46         0.28         0.46         0.29
##                         Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score
## GovernmentEffectiveness         0.61         0.65         0.53         0.75
## log_GDP                         0.60         0.70         0.62         0.70
## Overall_Score                   0.67         0.86         0.78         0.75
## Goal_1_Score                    0.47         0.71         0.75         0.57
## Goal_2_Score                    0.48         0.56         0.46         0.60
## Goal_3_Score                    0.60         0.79         0.74         0.70
## Goal_4_Score                    0.63         0.70         0.68         0.63
## Goal_5_Score                    1.00         0.61         0.48         0.60
## Goal_6_Score                    0.61         1.00         0.64         0.67
## Goal_7_Score                    0.48         0.64         1.00         0.46
## Goal_8_Score                    0.60         0.67         0.46         1.00
## Goal_9_Score                    0.60         0.74         0.56         0.72
## Goal_10_Score                   0.13         0.38         0.22         0.36
##                         Goal_9_Score Goal_10_Score
## GovernmentEffectiveness         0.87          0.36
## log_GDP                         0.87          0.37
## Overall_Score                   0.83          0.49
## Goal_1_Score                    0.73          0.46
## Goal_2_Score                    0.60          0.28
## Goal_3_Score                    0.83          0.46
## Goal_4_Score                    0.72          0.29
## Goal_5_Score                    0.60          0.13
## Goal_6_Score                    0.74          0.38
## Goal_7_Score                    0.56          0.22
## Goal_8_Score                    0.72          0.36
## Goal_9_Score                    1.00          0.43
## Goal_10_Score                   0.43          1.00
## 
## n= 152 
## 
## 
## P
##                         GovernmentEffectiveness log_GDP Overall_Score
## GovernmentEffectiveness                         0.0000  0.0000       
## log_GDP                 0.0000                          0.0000       
## Overall_Score           0.0000                  0.0000               
## Goal_1_Score            0.0000                  0.0000  0.0000       
## Goal_2_Score            0.0000                  0.0000  0.0000       
## Goal_3_Score            0.0000                  0.0000  0.0000       
## Goal_4_Score            0.0000                  0.0000  0.0000       
## Goal_5_Score            0.0000                  0.0000  0.0000       
## Goal_6_Score            0.0000                  0.0000  0.0000       
## Goal_7_Score            0.0000                  0.0000  0.0000       
## Goal_8_Score            0.0000                  0.0000  0.0000       
## Goal_9_Score            0.0000                  0.0000  0.0000       
## Goal_10_Score           0.0000                  0.0000  0.0000       
##                         Goal_1_Score Goal_2_Score Goal_3_Score Goal_4_Score
## GovernmentEffectiveness 0.0000       0.0000       0.0000       0.0000      
## log_GDP                 0.0000       0.0000       0.0000       0.0000      
## Overall_Score           0.0000       0.0000       0.0000       0.0000      
## Goal_1_Score                         0.0000       0.0000       0.0000      
## Goal_2_Score            0.0000                    0.0000       0.0000      
## Goal_3_Score            0.0000       0.0000                    0.0000      
## Goal_4_Score            0.0000       0.0000       0.0000                   
## Goal_5_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_6_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_7_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_8_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_9_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_10_Score           0.0000       0.0005       0.0000       0.0002      
##                         Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score
## GovernmentEffectiveness 0.0000       0.0000       0.0000       0.0000      
## log_GDP                 0.0000       0.0000       0.0000       0.0000      
## Overall_Score           0.0000       0.0000       0.0000       0.0000      
## Goal_1_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_2_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_3_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_4_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_5_Score                         0.0000       0.0000       0.0000      
## Goal_6_Score            0.0000                    0.0000       0.0000      
## Goal_7_Score            0.0000       0.0000                    0.0000      
## Goal_8_Score            0.0000       0.0000       0.0000                   
## Goal_9_Score            0.0000       0.0000       0.0000       0.0000      
## Goal_10_Score           0.1107       0.0000       0.0060       0.0000      
##                         Goal_9_Score Goal_10_Score
## GovernmentEffectiveness 0.0000       0.0000       
## log_GDP                 0.0000       0.0000       
## Overall_Score           0.0000       0.0000       
## Goal_1_Score            0.0000       0.0000       
## Goal_2_Score            0.0000       0.0005       
## Goal_3_Score            0.0000       0.0000       
## Goal_4_Score            0.0000       0.0002       
## Goal_5_Score            0.0000       0.1107       
## Goal_6_Score            0.0000       0.0000       
## Goal_7_Score            0.0000       0.0060       
## Goal_8_Score            0.0000       0.0000       
## Goal_9_Score                         0.0000       
## Goal_10_Score           0.0000


  • Correlation plot

The correlation data can be also visualized in the following correlation plot, where the results reveal strong relationships between some of the variables. For instance, there is a high correlation between SDG 3 (Good Health and Well-being) and GDP per capita, implying that a country’s overall well-being is related to its level of GDP per capita. Furthermore, we observe a high correlation between SDG 3 (Good Health and Well-being) and SD4 (Quality Education), suggesting that countries with good quality education also tend to have high levels of health and well-being. Additionally, the matrix reveals a strong relationship between SDG 6 (Clean Water and Sanitation) and SDG 9 (Industry, Innovation and Infrastructure), indicating a link between access to clean water and sanitation and the development of industry and infrastructure.

library(Hmisc)
library(corrplot)
library(PerformanceAnalytics)

# Visualization of the data matrix
corrplot(cor_matrix, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 45)


  • Matrix of scatter plots

Additionally, we can represent the correlation between GDP per capita, Government Effectiveness, and Overall Score, together with its distribution and scatter plot in the matrix below. In this way, we observe that the three variables exhibit normal distribution and also, there exists a strong correlation between them. Moreover, the asterisks reveal that such relationships between the variables is significant.

# Matrix of scatter plots

my_data <- data[ c('log_GDP','Overall_Score','GovernmentEffectiveness')] 
chart.Correlation(my_data, histogram=TRUE,pch="+")


4. Principal Component Analysis (PCA)

In this section we focus on the Principal Component Analysis (PCA), a technique focused on representing multivariate data with a smaller number of variables without significant loss of information, allowing to find hidden relationships between variables.


PCA

Performing PCA can help us identifying which SDGs and other economic and political variables are most influential in explaining the variability in our data. Subsequently, we can use these components to assess the progress of each country. Therefore, to perform the PCA analysis, we extract the numeric variables from our original dataset and scale them, retaining only those corresponding to the 17 SDGs, GDP per capita (log transformation), and Government Effectiveness.

# Extract and scale variables
data = merged_data %>% dplyr::select(-c(Country_Name, Country_Code, Region, Income_group, Overall_Score, GDPpercapita))

 

# PCA 
pca = prcomp(data, scale=T)
summary(pca)
## Importance of components:
##                           PC1     PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     3.2709 1.22182 1.1342 0.95584 0.88062 0.83087 0.76201
## Proportion of Variance 0.5631 0.07857 0.0677 0.04809 0.04082 0.03633 0.03056
## Cumulative Proportion  0.5631 0.64168 0.7094 0.75746 0.79828 0.83461 0.86518
##                            PC8     PC9    PC10    PC11   PC12    PC13    PC14
## Standard deviation     0.66613 0.63385 0.55701 0.53691 0.4988 0.46729 0.40001
## Proportion of Variance 0.02335 0.02115 0.01633 0.01517 0.0131 0.01149 0.00842
## Cumulative Proportion  0.88853 0.90968 0.92600 0.94118 0.9543 0.96577 0.97419
##                           PC15    PC16    PC17    PC18    PC19
## Standard deviation     0.37039 0.35524 0.29374 0.27668 0.25342
## Proportion of Variance 0.00722 0.00664 0.00454 0.00403 0.00338
## Cumulative Proportion  0.98141 0.98805 0.99259 0.99662 1.00000


Given the PCA results, we first focus on the standard deviation of each principal component (PC), which indicates the spread of the data along that component. As we can observe, PC1 has the highest standard deviation, suggesting that PC1 captures the most variation in the data, followed by PC2 and PC3.

In relation to the proportion of variance, which explains how much of the total variability is captured by each component, the results show that PC1 explains the highest proportion of variance, with a value equal to 56.4%, followed by PC2 and PC3, whose proportions are around 7.9% and 6.6%, respectively.


Additionally, we can visualize the previous findings in the following plot, where we observe that the first and second principal components capture 64% of the total variability. Furthermore, using the first three principal components together we can explain around 71% of the total variance.

Based on this analysis, the results suggest that the first few principal components are essential for capturing the variability in the data. Conversely, the remaining components are able to explain less variability in the data.

fviz_screeplot(pca, addlabels = TRUE)


4.1 First Principal Component (PC1)

  • Contribution of variables to PC1

In this step we focus on analysing the First Principal Component. The following plot, where positive values indicate that higher values of the variable are associated with higher values of PC1 and negative values indicate that higher values of the variable are associated with lower values of PC1, suggests that GDP, Goal 3, Goal 9 and Goal 16 are the variables that mostly contribute to the PC1. Conversely, Goal 14 and Goal 15 depict bars close to zero line, meaning that they have less influence on PC1 and consequently, contribute less to explaining the variability in the data. Further, we can also observe that bars corresponding to Goal 12 and Goal 13 are below zero, indicating a negative association with the PC1.

barplot_pc1 <- barplot(pca$rotation[,1], las=2, col="lightblue")



Additionally, we can also visualize the contribution of the variables to the First Principal Component in the chart below, being the variables located to left the ones that contribute the most. This way, we can assume that Goal 3, Goal 9, and GDP are the variables with the highest contributions to PC1 and therefore, they are the most important for explaining the variability captured by PC1. In contrast, Goal 14 and Goal 15, which are the variables located to the right side, contribute less to PC1.

Moreover, the red dashed line, which indicates the expected average contribution of the variables, help us identify which of them make a more significant contribution to the principal component. This way, those variables whose contributions exceed this red line are considered to be more influential in explaining the variability captured by the principal component.

fviz_contrib(pca, choice = "var", axes = 1)


  • Rank the SDG’s assessment of the countries by the PC1

In this sense, the countries with higher scores can be interpreted as making more progress towards achieving the SDGs, as it is the case of Norway, Denmark, Sweden, Austria, Finland, New Zealand, Netherlands, Switzerland, Germany, and Ireland. Conversely, those countries with lower scores can be interpreted as struggling in assessing the SDGs, such as Central African Republic, South Sudan, Chad, Somalia, Afghanistan, Sudan, Liberia, Haiti, Madagascar and Niger.

Country = merged_data$Country_Name
Region = merged_data$Region
Income_group = merged_data$Income_group


low_progress <- Country[order(pca$x[,1])][1:5]  
high_progress <- Country[order(pca$x[,1], decreasing=T)][1:5] 


4.2 Second Principal Component (PC2)

  • Contribution of variables to second component

Next, we examine the Second Principal Component. In the following bar plot we can visualize the contribution of each variable to the Second Principal Component, helping to understand the patterns in the data captured by that component.

In this sense, we can notice that the PC2 is negatively associated with two of the variables, corresponding to SDG 14 (Life Bellow Water) and SDG 15 (Life on Land), implying that higher values of sustainable use of the oceans and the territorial ecosystems are associated with lower values of PC2. Regarding the rest of the variables, we can observe that their contribution to the PC2 is relatively low, given their proximity to the zero line.

barplot(pca$rotation[,2], las=2, col="darkblue")



Alternatively, we can also visualize the contribution of each SDG to the Second Principal Component in the plot depicted below. In line with our previous conclusions, the variables located in the left side exceeding the red dashed line, are SDG 14 (Life Bellow Water) and SDG 15 (Life on Land), implying that they are the ones that contribute the most to the PC2.

fviz_contrib(pca, choice = "var", axes = 2)


  • Rank the SDG’s assessment of the countries by the PC2

To get more insights about the PC2, we rank the countries using this component. By doing so, we can interpret their positions in relation to the two influential variables identified (Life Bellow Water and Life on Land). Accordingly, countries with higher positive scores on PC2 are more positively associated with these two environmental SDG, as it is the case of Namibia and Cuba, while those with lower scores are less associated, such as Singapore or Bahrain.

low_contribution  <- Country[order(pca$x[,2])][1:5] 
print(low_contribution )
## [1] "Namibia"  "Cuba"     "Suriname" "Finland"  "Estonia"
high_contribution <- Country[order(pca$x[,2], decreasing=T)][1:5]
print(high_contribution)
## [1] "Singapore" "Mauritius" "Bahrain"   "Israel"    "Guyana"


4.3 Conclusions from PCA


  • Visualization of PC1 and PC2 by region

After analysing the First and Second Principal Components, which together explain 64% of the variability in our data, we plot the scores on both components for each country . This way, we can visualize that countries grouped together on the plot have very similar scores in PC1 and PC2, suggesting that they have similar patterns in their data and, consequently, implying similar progress in the assessment of the Sustainability Development Goals. In addition, the color of the points indicates the region to which the countries belong, allowing us to identify potential similarities across countries belonging to the same region.

Accordingly, we observe similar patterns between European countries, since all of them are located together, implying that their PC1 and PC2 scores are quite close. The same applies for most of the Sub-Saharan Africa and Latin America & Caribbean regions, whose countries are plotted together in the left and middle of the chart, respectively.

Conversely, countries belonging to the remaining regions (East Asia & Pacific, Middle East & North Africa, North America and South Asia) do not show similarities, as they are separately located on this chart. In this sense, we can observe how the United States, Japan, and Australia show similar patterns to those exhibited by European countries, despite belonging to another geographical region. Additionally, we can also observe how the scores of Haiti and Pakistan resemble to the patterns depicted by countries in Sub-Saharan Africa, despite being geographically distant.

data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label = Country,color = Region)) + geom_point(size=0) +
  labs(title="PC1 and PC2 scores", x="PC1", y="PC2") + 
  guides(color=guide_legend(title = "Region")) +
  theme_bw() + 
  theme(legend.position="bottom") + 
  geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE) 


  • Visualization of PC1 and PC2 by income level

Furthermore, it might be interesting to visualize the scores for PC1 and PC2 for each country, but distinguish between income groups instead of geographical regions. This can provide insights into how economic factors contribute to the positioning of countries in the plot.

In this way, we can observe how four differentiated groups of countries appear, coinciding with the four income level classifications. This result suggest that income level is more determinant than geographical location when assessing the degree of achievement of the Sustainability Development Goals.

data.frame(z1=pca$x[,1],z2=pca$x[,2]) %>% 
  ggplot(aes(z1,z2,label = Country,color = Income_group)) + geom_point(size=0) +
  labs(title="PC1 and PC2 scores", x="PC1", y="PC2") + 
  guides(color = guide_legend(title="Income Group")) +
  theme_bw() + 
  theme(legend.position="bottom") + 
  geom_text(size=3, hjust=0.6, vjust=0, check_overlap = TRUE) 


Below we observe that the countries that have achieved a greater sustainability development are those belonging to the high-income level group.

data.frame(z1=pca$x[,1],Income_group)  %>% 
  group_by(Income_group)  %>% 
  summarise(mean = mean(z1), n=n()) %>% 
  arrange(desc(mean))
## # A tibble: 4 × 3
##   Income_group          mean     n
##   <fct>                <dbl> <int>
## 1 High income          3.54     49
## 2 Upper middle income  0.421    41
## 3 Lower middle income -2.14     42
## 4 Low income          -5.04     20


  • Map

Lastly, we can visualize the PC1 scores on a world map, where darker areas represent higher scores along PCA1, while lighter shades represent lower scores captured by the PCA1. This allows us to identify clusters of countries with similar scores, which may indicate similarities in the data in terms of the assessment of the Sustainability Development Goals.

map = data.frame(country = Country, value=pca$x[,1])

map$country = countrycode(map$country, 'country.name', 'iso3c')

matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
               addLegend = TRUE, borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "heat", #white2Black black2White palette heat topo terrain rainbow negpos8 negpos9 
               mapTitle = c("PCA1 by Country"), lwd=1)


According to the results, we observe distinct patterns: countries shaded in red and dark orange areas exhibit better performance regarding SDGs achievement. These countries are predominantly situated in Europe, North America and Australia, corresponding with higher income levels. Conversely, countries shaded in yellow and light orange struggle to achieve the SDGs, primarily situated in Sub-Saharan Africa and South Asia. Furthermore, countries in Latin America and East Asia demonstrate moderate progress in the SDGs, represented by the orange colour, which falls in the middle point of the scale. Moreover, it should be noted that there is no overlap between colors on the map, indicating a close resemblance between the environmental map and the geographic distribution.


5. Clustering tools and interpretation

After conducting a Principal Component Analysis (PCA) to reduce the dimensionality of our dataset and uncover underlying patterns, our next step involves applying clustering techniques. Clustering is a fundamental tool in unsupervised learning that groups similar observations together based on certain characteristics, allowing us to identify subgroups within our data. In this section, we will use clustering algorithms to explore the structure of our data and gain insights into distinct clusters of countries in relation to the attainment of the Development Sustainability Goals.


5.1 Choice of number of clusters

In clustering, selecting the appropriate number of clusters is one of the most important steps. Therefore, we first try various initial guesses to determine the most suitable number of clusters for our dataset.


K-means clustering with 4 clusters

After performing K-means clustering with 5 clusters, we obtain the following cluster sizes 39, 30, 45,and 38, indicating the number of countries grouped into each cluster. To analyse the distinct clusters, we can observe their centroids, which represent the mean values of the variables within each cluster. These results allows us to observe how each cluster exhibits different average levels of scores on the Sustainable Development Goals (SDGs):

  • Cluster 1: This cluster consists of countries with relatively lower scores across most SDGs compared to other clusters. While they show moderate scores in Goal 2 (Zero Hunger) and Goal 4 (Quality Education), they exhibit particularly low scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 13 (Climate Action). These countries might face challenges in industrial development, innovation, and addressing climate change.

  • Cluster 2: Countries in this cluster demonstrate moderate to high scores across most SDGs, with notable strengths in Goal 1 (No Poverty), Goal 3 (Good Health and Well-being), and Goal 7 (Affordable and Clean Energy). However, they show relatively lower scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 10 (Reduced Inequalities), indicating areas for improvement in industrial development and reducing inequalities.

  • Cluster 3: This cluster represents countries with relatively high scores across most SDGs, particularly excelling in Goal 1 (No Poverty), Goal 4 (Quality Education), and Goal 12 (Responsible Consumption and Production). However, they exhibit lower scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 13 (Climate Action), suggesting a need for more focus on industrial development and addressing climate change challenges.

  • Cluster 4: Countries in this cluster demonstrate high scores across most SDGs, showing strong performance in Goal 1 (No Poverty), Goal 3 (Good Health and Well-being), and Goal 16 (Peace, Justice, and Strong Institutions). However, they exhibit relatively lower scores in Goal 12 (Responsible Consumption and Production) and Goal 13 (Climate Action), indicating a need for more sustainable consumption patterns and efforts to combat climate change.

clustering_4 = kmeans(data, centers = 4, nstart=1000)
clustering_4
## K-means clustering with 4 clusters of sizes 42, 38, 42, 30
## 
## Cluster means:
##   GovernmentEffectiveness   log_GDP Goal_1_Score Goal_2_Score Goal_3_Score
## 1              -0.2106788  8.665975     90.52854     60.07368     73.59451
## 2              -0.9749605  7.073414     26.59214     50.93251     41.30747
## 3               1.0824422 10.606880     99.40765     66.75948     90.85321
## 4              -0.2181050  8.832832     82.49123     58.90180     71.17142
##   Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score Goal_9_Score
## 1     83.10379     57.82757     68.70789     72.78115     66.55840     40.74573
## 2     41.55357     47.69078     50.48503     42.83730     57.81520     16.23028
## 3     96.23095     76.42938     81.08382     75.78783     79.21510     81.33778
## 4     83.33559     67.62934     70.33072     72.88301     66.12517     43.15647
##   Goal_10_Score Goal_11_Score Goal_12_Score Goal_13_Score Goal_14_Score
## 1      75.83504      71.32693      87.15251      85.89809      61.32298
## 2      47.41955      47.50482      95.70163      97.90572      66.63736
## 3      85.40417      86.06820      65.86111      52.85738      62.91336
## 4      26.41953      77.07228      88.40605      85.50198      67.69310
##   Goal_15_Score Goal_16_Score Goal_17_Score
## 1      61.59968      69.25757      61.91748
## 2      64.20708      51.18080      50.87273
## 3      73.22698      81.75931      63.08022
## 4      62.50322      63.42087      62.07891
## 
## Clustering vector:
##   [1] 2 1 1 2 4 1 3 3 1 1 1 1 3 3 4 2 1 4 1 4 4 3 4 2 2 1 2 3 2 2 4 4 4 4 2 3 1
##  [38] 3 3 2 4 4 1 3 2 2 1 3 3 1 1 3 4 3 4 2 1 2 4 3 3 4 1 1 3 3 3 4 3 1 1 2 3 1
##  [75] 1 3 1 2 2 3 3 2 2 4 1 2 3 2 1 4 1 1 1 1 2 1 4 1 3 3 4 2 2 1 3 4 2 4 2 4 4
## [112] 4 3 3 3 1 3 2 2 1 2 1 2 3 3 3 2 4 2 3 1 2 4 3 3 2 1 2 1 2 3 1 4 2 1 3 3 3
## [149] 1 1 2 4
## 
## Within cluster sum of squares by cluster:
## [1]  87773.59 110865.46  86476.61  68532.30
##  (between_SS / total_SS =  63.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"


Furthermore, the centroids can be represented in the following bar plots to visualize the different average of the variables across clusters. These results suggest that there exist a distinct cluster formed by countries with a very good performance on SDG and conversely, a cluster whose countries depict a relatively low level of SDG progress. The other two remaining clusters show very similar characteristics, as the exhibit good overall performance.

centers=clustering_4$centers

barplot(centers[1,], las=2, col="darkblue")  

barplot(centers[2,], las=2, col="darkblue") 

barplot(centers[3,], las=2, col="darkblue") 

barplot(centers[4,], las=2, col="darkblue") 




Finally, we can illustrate the five distinct groups of countries in a cluster map, where each cluster is represented with the countries inside. As we can observe, the clusters overlap in the plot. However, we can distinguish between the cluster with the lowest SDG achievement, located on the left, and the cluster with the best performance, located on the right. Regarding the two groups in the middle, where overlapping is higher, they correspond to the two similar clusters with a medium-high progress of the SDG.

fviz_cluster(clustering_4, data = data, geom = c("point"),ellipse.type = 'norm', pointsize=1) +
  theme_minimal() + 
  geom_text(label=Country,hjust=0, vjust=0,size=2,check_overlap = F) +
  scale_fill_brewer(palette="Paired")



K-means clustering with 3 clusters

Next, we attempt to improve the interpretation by performing K-means clustering with 3 clusters to reduce overlapping and facilitate understanding of the results. However, it’s important to note that using 3 clusters instead of 4 doesn’t necessarily mean it’s a better choice.

According to the results shown below, we obtain 3 clusters of size 73, 40 and 39, which exhibit the following characteristics:

  1. Cluster 1: This cluster represents countries with moderate to high scores across most Sustainable Development Goals (SDGs). They show strengths in Goal 1 (No Poverty), Goal 3 (Good Health and Well-being), and Goal 7 (Affordable and Clean Energy), with relatively lower scores in Goal 9 (Industry, Innovation, and Infrastructure) and Goal 13 (Climate Action).

  2. Cluster 2: Countries in this cluster demonstrate high scores across most SDGs, particularly excelling in Goal 1, Goal 4 (Quality Education), and Goal 9. However, they exhibit lower scores in Goal 12 (Responsible Consumption and Production), suggesting a need for more sustainable consumption patterns.

  3. Cluster 3: This cluster comprises countries with relatively lower scores across most SDGs compared to the other clusters. They show particularly low scores in Goal 9 and Goal 10 (Reduced Inequalities), indicating challenges in industrial development and reducing inequalities.

clustering_3 = kmeans(data, centers=3, nstart=1000)  
clustering_3
## K-means clustering with 3 clusters of sizes 70, 43, 39
## 
## Cluster means:
##   GovernmentEffectiveness   log_GDP Goal_1_Score Goal_2_Score Goal_3_Score
## 1              -0.2224941  8.729759     87.56921     59.75460     72.81891
## 2               1.0606048 10.577636     99.42052     66.58298     90.60675
## 3              -0.9489487  7.110593     27.11874     50.86133     41.49274
##   Goal_4_Score Goal_5_Score Goal_6_Score Goal_7_Score Goal_8_Score Goal_9_Score
## 1     83.17859     61.42041     69.47511     73.36817     66.40066     41.83930
## 2     95.99746     76.31619     80.94542     75.49040     79.02669     80.52588
## 3     42.58386     48.68966     50.65881     42.88066     57.87245     16.60484
##   Goal_10_Score Goal_11_Score Goal_12_Score Goal_13_Score Goal_14_Score
## 1      55.35973      73.75308      87.70842      86.16175      63.96060
## 2      85.67345      85.89875      66.31418      52.85245      62.57510
## 3      46.34451      47.98935      95.49528      97.67252      66.99917
##   Goal_15_Score Goal_16_Score Goal_17_Score
## 1      61.68326      66.78235      61.87549
## 2      72.85695      81.45622      62.88104
## 3      64.79509      51.61087      51.54528
## 
## Clustering vector:
##   [1] 3 1 1 3 1 1 2 2 1 1 1 1 2 2 1 3 1 1 1 1 1 2 1 3 3 1 3 2 3 3 1 1 1 1 3 2 1
##  [38] 2 2 3 1 1 1 2 3 3 1 2 2 1 1 2 1 2 1 3 1 3 1 2 2 1 1 1 2 2 2 1 2 1 2 3 2 1
##  [75] 1 2 1 3 3 2 2 3 3 1 1 3 2 3 1 1 1 1 1 1 3 1 3 1 2 2 1 3 3 1 2 1 3 1 3 1 1
## [112] 1 2 2 2 1 2 3 3 1 3 1 3 2 2 2 3 1 3 2 1 3 1 2 2 3 1 3 1 3 2 1 1 3 1 2 2 2
## [149] 1 1 3 1
## 
## Within cluster sum of squares by cluster:
## [1] 192574.87  89460.91 118505.27
##  (between_SS / total_SS =  58.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"


Again, we can visualize the centroids of each cluster in bar plots, which facilitates the interpretation of the differences obtained from the results above.

centers=clustering_3$centers

barplot(centers[1,], las=2, col="darkblue") 

barplot(centers[2,], las=2, col="darkblue") 

barplot(centers[3,], las=2, col="darkblue") 



Finally, we can represent the three clusters on a cluster map, where each group is illustrated in a different color. We observe that overlapping in the plot is lower than with 4 clusters, which facilitates the interpretation of the results. Accordingly, we can differentiate three clear groups of countries: a cluster with lower SDG progress, located on the left; a cluster with good performance, plotted in the middle of the map; and lastly, a cluster with high SDG performance, situated on the right side.

Accordingly, we can differentiate three clear groups on the map: countries with poor performance in the SDGs, located on the left; countries performing very well in SDG assessment, situated on the right side; and lastly, countries with a neutral performance, plotted in the middle of the map

# clusplot
fviz_cluster(clustering_3, data = data, geom = c("point"),ellipse.type = 'norm', pointsize=1)+
  theme_minimal() + 
  geom_text(label = Country,hjust=0, vjust=0,size=2,check_overlap = F) +
  scale_fill_brewer(palette="Paired")



Methods for optimal clustering number

As previously mentioned, the number of groups is indeed a key point of clustering analysis. Various methods can provide hints or guidance in determining the optimal number of clusters. Three common techniques are Within Cluster Sums of Squares, Average Silhouette and Gap statistics. By considering the results from these methods collectively, we can gain insights into the most suitable number of clusters for our dataset.


  • Within Cluster Sums of Squares

According to this method, the optimal number of clusters is the one located at the point where the total within-cluster sum of squares decreases slower after adding another cluster. Taking a look at the plot, we can notice how at point k = 3 the, WCSS begins to slow down, and a smooth decrease takes place. Therefore, under this method, the suggested number of groups is 3.

fviz_nbclust(scale(data), kmeans, method = 'wss', k.max = 20, nstart = 1000)  # smooth decrease stars at k = 3: with this graph we get the hint that 3 groups might be the best



  • Average Silhouette

By using Average Silhouette the optimal number of clusters is the peak score, implying that a high average silhouette width indicates a good clustering. Therefore the plot below provides the hint that the 2 clusters might be the most suitable number of groups for our data.

fviz_nbclust(scale(data), kmeans, method = 'silhouette', k.max = 20, nstart = 1000) # with this formula, the higher the better: again the optimal is 2 groups



  • Gap statistics

The optimal number of clusters under the gap statistics method is the point where the gap statistic first reaches a peak. Accordingly, the optimal number of clusters suggested is 8, as the maximum gap statistic is reached at k = 8.

fviz_nbclust(scale(data), kmeans, method = 'gap_stat', k.max = 10, nstart = 100, nboot = 500)   


In relation to the hints provided by the three methods, it seems suitable to set a number of clusters between 2 and 8 to group our observations. Furthermore, since we are interested in exploring whether the groups of countries with similar attainment of the Development Sustainability Goals also converge in terms of their income level classification, we finally determine that the most appropriate number of groups for our analysis might be 4, which is a number that falls between the optimal ones suggested by the three methods and also allow us to check whether a low-income, middle- income, and high-income differentiation takes place within the three groups.

fit.kmeans_4 = kmeans(data, centers=4, nstart=1000) 


  • Map the clustering in a map: k = 4
map_4 = data.frame(country = Country, value=fit.kmeans_4$cluster)


map_4$country = countrycode(map_4$country, 'country.name', 'iso3c')


matched <- joinCountryData2Map(map_4, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
               borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "heat",
               mapTitle = c("Clusters"), lwd=1)


Finally, since we are interested not only in clustering countries based on their performance on the Development Sustainability Goals but also in analysing whether they share similar characteristics in terms of income level, we decide that the most suitable number of groups for our analysis is 4. As observed in the plot map above, this number of clusters leads to 4 differentiated groups of countries in terms of DSG performance, which also allows us to differentiate the four income levels within these groups.

The results reveal clear trends: countries shaded in red exhibit stronger performance in SDG achievement, largely clustered in Europe, North America, Australia, and New Zealand, reflecting their higher income levels. We can also notice a cluster formed by regions in the north of Asia and north of Africa, as well as some pacific islands, depicting similar levels of sustainable development. Conversely, countries shaded in yellow and light orange face challenges in meeting the SDGs, mainly located in Sub-Saharan Africa and Latin America. This patterns align with countries having lower income levels compared to those in more developed regions.


5.2 Hierarchical clustering

In this section we focus on hierarchical clustering, which is another clustering technique to group similar observation into clusters. The main characteristic of hierarchical clustering is that is organizes the data in a hierarchical structure, allowing us to explore potential hierarchical relationships within the data

One of the key advantages of hierarchical clustering is its ability to reveal hierarchical relationships within the data, allowing for a flexible and intuitive exploration of cluster structures.


  • Distance and linkage

Unlike other clustering methods, hierarchical clustering does not require the specification of the number of clusters. However, in this method, what is important is deciding the distance between the observations and the linkage to join groups.

Distance metrics measure the dissimilarity or similarity between data points, while linkage criteria determine how clusters are merged or split at each step of the algorithm.

d = dist(scale(data), method = "euclidean") 
hc <- hclust(d, method = "ward.D2") 


5.3 Visualization

There are several methods to visualize the hierarchical clustering, facilitating the interpretation of the results. In this section, we will use classical dendrograms, phylogenic trees, geographical maps, and heatmaps.


  • Classical dendrogram

As mentioned before, hierarchical clustering organizes the countries in a hierarchical structure, which can be visualized in a dendrogram. This dendrogram visually represents the relationships between observations, showing how they cluster together based on their similarities.

hc$labels <- Country

dend_plot <- fviz_dend(x = hc, 
          k=4,
          palette = "jco", 
          rect = TRUE, rect_fill = TRUE, cex=0.5,
          rect_border = "jco",
)
dend_plot


In the dendrogram above, distinguising the countries is quite difficult. However, two major hierarchical clusters can be noticed. On the one hand, there is a group corresponding to low SDGs performance on the left, from which two new clusters emerge, leading at the same time to lower clusters. On the other hand, there is a cluster on the right, including countries with better performance. From this cluster, two new clusters emerge, dividing observations into a group with very good attainment group and another with modest performance. Within the latter, more clusters of lower hierarchical categories from, futher clustering the observations of countries with lower performance.


To improve the visualization of the branches of lower hierarchical levels, we can plot the sub-plots, which allows us to notice the countries within each cluster at lower levels.

dend_data <- attr(dend_plot, "dendrogram") 
dend_cuts <- cut(dend_data, h = 40)

# Left subtree
fviz_dend(dend_cuts$lower[[1]], main = "Subtree 1")

# Right subtree
fviz_dend(dend_cuts$lower[[2]], main = "Subtree 2")


  • Phylogenic tree

To facilitate the visualization, we represent the data using a phylogenic tree, where results can be easily interpreted than in the previous dendrogram.

clusters <- cutree(hc, k = 4)
fviz_dend(x = hc,
          k = 4,
          color_labels_by_k = TRUE,
          cex = 0.8,
          k_color_palette = "jco",
          type = "phylogenic",
          repel = TRUE) +
  labs(title="Socio-economic-health tree clustering of the world") + 
  theme(axis.text.x=element_blank(),axis.text.y=element_blank())


  • Geographical map

Given that the observations of our data are countries, it might be also interesting to represent the hierarchical clustering results in a map, where each country is colored according to the cluster it belongs to.

groups.hc = cutree(hc, k = 4)

map = data.frame(country = Country, value = groups.hc)


map$country = countrycode(map$country, 'country.name', 'iso3c')

matched <- joinCountryData2Map(map, joinCode = "ISO3",
                               nameJoinColumn = "country")
## 152 codes from your data successfully matched countries in the map
## 0 codes from your data failed to match with a country code in the map
## 91 codes from the map weren't represented in your data
mapCountryData(matched,nameColumnToPlot="value",missingCountryCol = "white",
               borderCol = "#C7D9FF",
               catMethod = "pretty", colourPalette = "terrane",
               mapTitle = c("Clusters"), lwd=1)


According to the patterns displayed on the map, we observe some slight differences between results obtained from hierarchical clustering and the k-means approach. These variances can be attributed to the distinct algorithms and methdologies used by each clustering method, given that hierarchical clustering groups data into a tree-like structure based on similarity, which may result in different clusters compared to k-means.

In this context, we observe that countries shaded in light colors primarily correspond to regions in Latin America, Africa, and parts of Asia, consistent with the findings obtained from the k-means clustering. However, a new pattern emerges in South Asia, where countries appears as a distinct cluster, exhibiting relatively higher levels of sustainable development, followed by the usual developed regions, such as Europe, North America and Australia. These distinctions underscore the importance of considering various clustering methods to gain a comprehensive understanding of the underlying patterns within the data.


  • Heatmap

Finally, we can use a heatmap, also known as a false color map, which is a way to visualize hierarchical clustering using a color scale, including a dendrogram to the left side and to the top of the plot.

heatmap(scale(data),
        scale = "none", 
        labRow = Country,
        col = bluered(100), 
        distfun = function(x){dist(x, method = "euclidean")},
        hclustfun = function(x){hclust(x, method = "ward.D2")},
        main = "Heatmap of Data",
        xlab = "Sustainabilty Goals", ylab = "Countries",
        cexRow = 0.7, 
        margins = c(7,7))


To interpret the hierarchical clustering patterns illustrated in the heatmap, we need to observe the similarities and dissimilarities among both the countries (displayed in rows) and sustainability goals (represented in columns). In this context, the red color indicate higher values, while blue represents lower values, helping us to identify which countries and goals have higher or lower scores relative to the rest. Moreover, the left dendrogram indicates how the countries are clustered based on their similarities in achieving the sustainability goals, while the top dendrogram shows how the sustainability goals are clustered based on the similarities in their patterns across countries. Therefore, countries that are closer together on the dendrogram share more similar profiles in terms of their performance across the goals, such as the case of the Netherlands, Australia, Hungary and Portugal , which are located together and are represented by red colors for most of the goals, indicating a good performance in the assessment of the SDGs. Conversely, we observe similarities among Mali, Zambia, Rwanda, and Tanzania, clustered together and shaded in blue for most of the SDGs, indicating weak enviromental progress. It can be also noticed a cluster formed by regions with moderate progress, as they are shaded with both blue and red colors, such as Bolivia, Lebanon and Mauritius.


6. Final conclusions

The study conducted in this study provides a comprehensive understanding of the Sustainable Development Goals (SDGs) performance across countries. This analysis is achieved through the use of Principal Component Analysis (PCA) and clustering techniques, while also linking such sustainable performance with the economic and governance quality of the countries, whose results indicate that income level plays a more significant role than geographical location in determining the degree of achievement of the Sustainability Development Goals.

Through PCA, we identified the key variables contributing to the variability in SDG performance. The analysis revealed that variables such as GDP, Government Effectiveness, and specific SDGs like Goal 3 (Good Health and Well-being) and Goal 16 (Peace, Justice, and Strong Institutions) have a significant impact on the overall variability in SDG performance. Conversely, the sustainable use of the oceans and territorial ecosystems are the variables that less contribute to the achievement of the SDGs.By visualizing the scores on PC1 and PC2, we observed distinct clusters of countries with similar patterns of SDG performance, allowing for deeper insights into regional and income-level disparities. These results support the idea that those countries which converge in terms of SDG performance, also have similar characteristics regarding economic and political structure.

Subsequently, K-means clustering was employed to group countries based on their SDG performance. Initially exploring different numbers of clusters, we found that clustering into 4 groups provided clear distinctions between countries with high, moderate, and low SDG performance. Hierarchical clustering further supported some of these findings, revealing hierarchical relationships between clusters and highlighting the influence of income level on SDG attainment. However, new patterns emerged for some specific clusters. These distinctions in the results obtained from k-means clustering and hierarchical approach show the importance of considering various clustering methods to gain a deeper understanding of the underlying patterns within our data.

Overall, the analysis provided valuable insights into the complex topic of achieving global sustainable development, enabling policymakers and politician to identify priority areas for intervention to address the challenges in SDG performance. At the same time, the analysis pointed out the importance of considering the economic and political structure of each country when addressing sustainable development, as areas with higher income levels and more efficient governance tend to succeed in achieving the SDGs, while those countries with weaker economies and governance quality tend to struggle with sustainable development. Accordingly, by using data approaches like PCA and clustering, we can work towards achieving the goal of sustainable development for all nations.